This bike sharing dataset (hour.csv) was obtained from UCI machine learning repository. Below is information of the dataset extracted and modified from the included “Readme.txt” :
Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
The bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C.
hour.csv
- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of bike rented bycasual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Prediction of hourly bike rental count based on the environmental and seasonal settings.
Let’s read the input and take a look at its structure.
# Read
hour.data <- read.csv("hour.csv", header= TRUE,stringsAsFactors = FALSE)
# Overview
str(hour.data)
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
head(hour.data)
## instant dteday season yr mnth hr holiday weekday workingday
## 1 1 2011-01-01 1 0 1 0 0 6 0
## 2 2 2011-01-01 1 0 1 1 0 6 0
## 3 3 2011-01-01 1 0 1 2 0 6 0
## 4 4 2011-01-01 1 0 1 3 0 6 0
## 5 5 2011-01-01 1 0 1 4 0 6 0
## 6 6 2011-01-01 1 0 1 5 0 6 0
## weathersit temp atemp hum windspeed casual registered cnt
## 1 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 1 0.22 0.2727 0.80 0.0000 8 32 40
## 3 1 0.22 0.2727 0.80 0.0000 5 27 32
## 4 1 0.24 0.2879 0.75 0.0000 3 10 13
## 5 1 0.24 0.2879 0.75 0.0000 0 1 1
## 6 2 0.24 0.2576 0.75 0.0896 0 1 1
Split data into training and testing datasets for applying models.
train <- hour.data[as.integer(substr(hour.data$dteday,9,10)) < 22, ]
test <- hour.data[as.integer(substr(hour.data$dteday,9,10)) > 21, ]
# Training: 69.2%
nrow(train)/ nrow(hour.data)
## [1] 0.6924449
# Testing: 30.7%
nrow(test)/ nrow(hour.data)
## [1] 0.3075551
GOAL: Apply different models to find predictive results of hourly total rental of a day
Technical explanation: For each model built upon training dataset, there will be predictive values against actual values of the testing dataset. The measurement used here is mean((y - yhat)^2) i.e. Mean Squared Error(MSE). We are going to apply multiple models and figure out the one that minimizes MSE. The total rental (cnt) of a day is the sum of registered users (registered) and casual users (casual). After trying different combinations, we found out that using 2 separate models to predict the number of registered users and casual users yield better result than using a single model to predict total rentals (cnt)
The followings are models we have applied:
The results for each model are “cnt.MSE”, “combined.MSE”, “registered.MSE”, “casual.MSE”
# Load packages
library(nnet)
library(ggplot2)
library(ggthemes)
library(gbm)
library(randomForest)
library(e1071)
library(rpart)
First, get data ready. Before factorizing some of the attributes, we leave numeric variables as they are for Neural Networks.
Step1: Orignal model
# Orignal Model
neural.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model = nnet(neural.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 281
## initial value 838926398.354622
## final value 394926965.983082
## converged
testset=subset(test, select = c("season", "yr" , "mnth" , "hr" , "holiday" , "weekday" , "workingday" , "weathersit" , "temp" , "atemp" , "hum" , "windspeed"))
neural.cnt = predict(neural.model, testset, type="raw")
test$neural.cnt=neural.cnt
# Compute MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 33104.73
# Plot to check result
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result
# Change negative result to positive
test$neural.cnt[test$neural.cnt < 0] = 0
# Compute new MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 33104.73
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Take off weedspeed and holiday yeild to the best result
neural.registered.formula = registered~season + yr + mnth + hr + weekday + workingday+ weathersit + temp + atemp + hum
neural.model.registered = nnet(neural.registered.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 241
## initial value 559079226.792752
## iter 10 value 244122412.256148
## iter 20 value 198835761.212141
## iter 30 value 172241855.672617
## iter 40 value 159896361.855667
## iter 50 value 149723367.246861
## iter 60 value 143958105.397097
## iter 70 value 140582206.518999
## iter 80 value 135858127.292951
## iter 90 value 134444900.342117
## iter 100 value 131076044.824574
## iter 110 value 129003257.450227
## iter 120 value 126803272.430574
## iter 130 value 125148403.082232
## iter 140 value 123765072.112062
## iter 150 value 122833001.124674
## iter 160 value 122420570.457014
## iter 170 value 122146540.729700
## iter 180 value 121508383.111550
## iter 190 value 120797501.212617
## iter 200 value 119289090.475980
## iter 210 value 117283661.656061
## iter 220 value 116928914.257144
## iter 230 value 116455547.924404
## iter 240 value 116196989.513513
## iter 250 value 116006729.517431
## iter 260 value 115853339.560867
## iter 270 value 115744019.563024
## iter 280 value 115564123.052955
## iter 290 value 114046266.827014
## iter 300 value 113380129.362451
## iter 310 value 113099404.896490
## iter 320 value 112622887.171588
## iter 330 value 112207871.221034
## iter 340 value 111974808.939365
## iter 350 value 111644077.477575
## iter 360 value 111366566.970739
## iter 370 value 111080020.921324
## iter 380 value 111020365.526848
## iter 390 value 110897315.553646
## iter 400 value 110779600.818873
## iter 410 value 110644639.842047
## iter 420 value 110569880.114434
## iter 430 value 110501473.108163
## iter 440 value 110441533.882779
## iter 450 value 110414691.564901
## iter 460 value 110383885.135484
## iter 470 value 110290044.217288
## iter 480 value 110210184.129649
## iter 490 value 110113569.851962
## iter 500 value 110062291.187289
## iter 510 value 110005160.290864
## iter 520 value 109978847.879917
## iter 530 value 109961610.877135
## iter 540 value 109941394.466741
## iter 550 value 109922017.953852
## iter 560 value 109901834.539828
## iter 570 value 109826683.005086
## iter 580 value 109753915.727100
## iter 590 value 109676437.324137
## iter 600 value 109561225.828613
## iter 610 value 108799917.119944
## iter 620 value 108402966.945694
## iter 630 value 105917123.445489
## iter 640 value 104814292.194800
## iter 650 value 101439097.744597
## iter 660 value 98970212.936261
## iter 670 value 97465910.893124
## iter 680 value 96805237.009146
## iter 690 value 95580666.069254
## iter 700 value 95080036.931786
## iter 710 value 94729714.984066
## iter 720 value 93574507.077844
## iter 730 value 92812232.142079
## iter 740 value 91544940.245097
## iter 750 value 89453453.290026
## iter 760 value 87073511.723972
## iter 770 value 82481155.311038
## iter 780 value 79617302.936259
## iter 790 value 77456387.416356
## iter 800 value 76362693.260844
## iter 810 value 75516901.472948
## iter 820 value 74847652.832301
## iter 830 value 74322434.453099
## iter 840 value 73848367.078481
## iter 850 value 73081127.822422
## iter 860 value 72488329.006398
## iter 870 value 72035374.196297
## iter 880 value 71228895.331938
## iter 890 value 69960770.672154
## iter 900 value 67656451.233689
## iter 910 value 62979954.060001
## iter 920 value 57660771.614261
## iter 930 value 55394785.880783
## iter 940 value 53058737.232892
## iter 950 value 52705536.559385
## iter 960 value 52531554.511734
## iter 970 value 52429550.678954
## iter 980 value 52234374.537650
## iter 990 value 51922074.421280
## iter1000 value 51710030.019903
## iter1010 value 51549289.829091
## iter1020 value 51446092.755695
## iter1030 value 51375060.380095
## iter1040 value 51223718.368806
## iter1050 value 51058149.851851
## iter1060 value 50659691.616179
## iter1070 value 50420438.786517
## iter1080 value 50207396.675964
## iter1090 value 49783403.792431
## iter1100 value 49437857.683047
## iter1110 value 48918477.870983
## iter1120 value 48752747.363887
## iter1130 value 48215506.907311
## iter1140 value 47656183.238526
## iter1150 value 47058193.091308
## iter1160 value 46187839.801406
## iter1170 value 45039720.867068
## iter1180 value 43946970.259374
## iter1190 value 43532922.402782
## iter1200 value 43246679.685316
## iter1210 value 42800341.303212
## iter1220 value 42022339.931764
## iter1230 value 40400217.698716
## iter1240 value 39489599.787395
## iter1250 value 38353772.465059
## iter1260 value 37820524.038655
## iter1270 value 36748161.996172
## iter1280 value 35810227.011449
## iter1290 value 35616556.495050
## iter1300 value 35248877.807562
## iter1310 value 34639320.615514
## iter1320 value 34382316.374729
## iter1330 value 34214160.281367
## iter1340 value 34192328.535653
## iter1350 value 34152339.295509
## iter1360 value 34111114.161009
## iter1370 value 34097937.004546
## iter1380 value 34078436.946578
## iter1390 value 34051061.526745
## iter1400 value 34037571.769580
## iter1410 value 34025135.465457
## iter1420 value 33999130.184042
## iter1430 value 33950323.293723
## iter1440 value 33903986.762989
## iter1450 value 33829933.615261
## iter1460 value 33697518.017423
## iter1470 value 33520626.642584
## iter1480 value 33190208.408215
## iter1490 value 32894186.811921
## iter1500 value 32466036.235305
## iter1510 value 32036000.655080
## iter1520 value 31938142.622689
## iter1530 value 31906610.041876
## iter1540 value 31871882.273023
## iter1550 value 31779252.245909
## iter1560 value 31670905.659265
## iter1570 value 31526407.717114
## iter1580 value 31357983.550165
## iter1590 value 31302805.562268
## iter1600 value 31219840.603303
## iter1610 value 31158047.803200
## iter1620 value 31041234.174500
## iter1630 value 30790626.145853
## iter1640 value 30673098.508733
## iter1650 value 30566799.993406
## iter1660 value 30476549.315130
## iter1670 value 30335209.449961
## iter1680 value 30236122.739286
## iter1690 value 30086903.824584
## iter1700 value 30003822.418413
## iter1710 value 29958724.947062
## iter1720 value 29896820.698914
## iter1730 value 29795225.837558
## iter1740 value 29610809.132116
## iter1750 value 29390604.433224
## iter1760 value 29266915.598379
## iter1770 value 29174840.906512
## iter1780 value 29044946.715798
## iter1790 value 28896258.549168
## iter1800 value 28804354.054914
## iter1810 value 28704166.613556
## iter1820 value 28625558.477181
## iter1830 value 28568733.215553
## iter1840 value 28529991.792531
## iter1850 value 28469317.911875
## iter1860 value 28385640.409149
## iter1870 value 28244681.360978
## iter1880 value 27991582.970112
## iter1890 value 27680652.076570
## iter1900 value 27349407.086989
## iter1910 value 27048744.918593
## iter1920 value 26585907.329297
## iter1930 value 26017838.410745
## iter1940 value 25407932.101723
## iter1950 value 24896642.186017
## iter1960 value 24679219.077776
## iter1970 value 24484155.395033
## iter1980 value 24470652.440568
## iter1990 value 24452754.574735
## iter2000 value 24433932.359369
## iter2010 value 24422593.757120
## iter2020 value 24414327.745760
## iter2030 value 24398835.089358
## iter2040 value 24386417.833843
## iter2050 value 24364964.561560
## iter2060 value 24344364.335329
## iter2070 value 24330941.295770
## iter2080 value 24318483.572976
## iter2090 value 24285606.613261
## iter2100 value 24268950.251871
## iter2110 value 24242930.886654
## iter2120 value 24205612.422663
## iter2130 value 24179846.545917
## iter2140 value 24159477.547740
## iter2150 value 24125680.782277
## iter2160 value 24099422.121075
## iter2170 value 24080010.809248
## iter2180 value 24034384.169264
## iter2190 value 23971444.366787
## iter2200 value 23937069.000504
## iter2210 value 23908727.288832
## iter2220 value 23899393.375325
## iter2230 value 23893490.207669
## iter2240 value 23889889.802172
## iter2250 value 23888723.392403
## iter2260 value 23887665.261134
## iter2270 value 23885947.736077
## iter2280 value 23877638.608157
## iter2290 value 23863749.677693
## iter2300 value 23846211.463911
## iter2310 value 23835682.210673
## iter2320 value 23822754.388774
## iter2330 value 23807289.942873
## iter2340 value 23792139.766723
## iter2350 value 23771462.036692
## iter2360 value 23762597.508772
## iter2370 value 23753368.520960
## iter2380 value 23743302.551354
## iter2390 value 23731956.487801
## iter2400 value 23716496.428334
## iter2410 value 23701454.000702
## iter2420 value 23683188.881635
## iter2430 value 23678827.673857
## iter2440 value 23671115.094533
## iter2450 value 23665062.596193
## iter2460 value 23652990.403843
## iter2470 value 23632827.603202
## iter2480 value 23593735.326086
## iter2490 value 23578637.551862
## iter2500 value 23566109.017511
## iter2510 value 23545639.250071
## iter2520 value 23528554.018431
## iter2530 value 23508663.384315
## iter2540 value 23480807.397958
## iter2550 value 23452509.894150
## iter2560 value 23440202.892575
## iter2570 value 23430094.954674
## iter2580 value 23427965.121837
## iter2590 value 23427383.902601
## iter2600 value 23426502.815646
## final value 23426257.224004
## converged
neural.registered = predict(neural.model.registered, testset, type="raw")
test$neural.registered = neural.registered
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3450.674
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result
test$neural.registered[test$neural.registered < 0] = 0
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3407.284
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result
# Model for Casual
neural.casual.formula = casual~season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model.casual = nnet(neural.casual.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights: 281
## initial value 47551479.555417
## iter 10 value 25785482.540706
## iter 20 value 21236529.317735
## iter 30 value 17849897.644603
## iter 40 value 16427977.929016
## iter 50 value 14288098.905237
## iter 60 value 12579896.951259
## iter 70 value 11675262.602377
## iter 80 value 10167238.524467
## iter 90 value 9551156.252311
## iter 100 value 9421502.681934
## iter 110 value 9236508.832616
## iter 120 value 9080640.239937
## iter 130 value 8902699.195465
## iter 140 value 8740091.350766
## iter 150 value 8610479.549960
## iter 160 value 8484676.930354
## iter 170 value 8396722.218094
## iter 180 value 8275044.225531
## iter 190 value 8175028.024997
## iter 200 value 8114694.560955
## iter 210 value 8003006.674698
## iter 220 value 7904862.542683
## iter 230 value 7827144.777516
## iter 240 value 7762265.835883
## iter 250 value 7691956.707542
## iter 260 value 7627458.830688
## iter 270 value 7564236.767834
## iter 280 value 7548312.857019
## iter 290 value 7537651.168086
## iter 300 value 7522748.368193
## iter 310 value 7509174.845369
## iter 320 value 7498216.730831
## iter 330 value 7486523.181751
## iter 340 value 7477661.432836
## iter 350 value 7457257.605282
## iter 360 value 7438923.777386
## iter 370 value 7419541.469348
## iter 380 value 7394444.293609
## iter 390 value 7368713.569773
## iter 400 value 7321853.895909
## iter 410 value 7244135.077540
## iter 420 value 7140991.581827
## iter 430 value 7036358.203874
## iter 440 value 6972381.298790
## iter 450 value 6914321.686193
## iter 460 value 6856272.944912
## iter 470 value 6817522.541173
## iter 480 value 6785423.653557
## iter 490 value 6762828.412867
## iter 500 value 6725935.445146
## iter 510 value 6695590.541968
## iter 520 value 6658369.031018
## iter 530 value 6608469.186169
## iter 540 value 6525582.338790
## iter 550 value 6417161.760841
## iter 560 value 6324413.080046
## iter 570 value 6206249.439500
## iter 580 value 6033320.303890
## iter 590 value 5927730.524158
## iter 600 value 5838106.165731
## iter 610 value 5753023.211335
## iter 620 value 5709127.001583
## iter 630 value 5670781.798124
## iter 640 value 5669488.394265
## iter 650 value 5665231.904919
## iter 660 value 5658400.628990
## iter 670 value 5655462.798749
## iter 680 value 5648925.261073
## iter 690 value 5644460.945519
## iter 700 value 5639574.943211
## iter 710 value 5635765.036584
## iter 720 value 5626310.039263
## iter 730 value 5614392.997858
## iter 740 value 5600642.505578
## iter 750 value 5579107.501616
## iter 760 value 5557947.864030
## iter 770 value 5545873.643261
## iter 780 value 5535727.602083
## iter 790 value 5519894.589608
## iter 800 value 5512209.300632
## iter 810 value 5497582.341541
## iter 820 value 5481749.423925
## iter 830 value 5458404.145511
## iter 840 value 5439809.048920
## iter 850 value 5428243.762038
## iter 860 value 5418655.780232
## iter 870 value 5405152.459876
## iter 880 value 5383756.803408
## iter 890 value 5367986.456694
## iter 900 value 5356639.400049
## iter 910 value 5355278.764323
## iter 920 value 5354075.525070
## iter 930 value 5352585.843010
## iter 940 value 5350979.801444
## iter 950 value 5347499.321808
## iter 960 value 5345138.914400
## iter 970 value 5342260.768347
## iter 980 value 5339830.648057
## iter 990 value 5336918.440743
## iter1000 value 5334148.967456
## iter1010 value 5326408.457340
## iter1020 value 5315954.564087
## iter1030 value 5306384.458862
## iter1040 value 5289363.891101
## iter1050 value 5275356.847291
## iter1060 value 5257787.628485
## iter1070 value 5237864.748624
## iter1080 value 5217583.398910
## iter1090 value 5198233.100384
## iter1100 value 5182145.962776
## iter1110 value 5154952.811982
## iter1120 value 5046426.612458
## iter1130 value 4932740.127228
## iter1140 value 4824955.129080
## iter1150 value 4738758.948503
## iter1160 value 4702028.300774
## iter1170 value 4696180.504711
## iter1180 value 4693908.455622
## iter1190 value 4691043.545441
## iter1200 value 4687841.079521
## iter1210 value 4683378.957810
## iter1220 value 4678813.020215
## iter1230 value 4675174.869844
## iter1240 value 4674060.899821
## iter1250 value 4673165.827489
## iter1260 value 4671326.817299
## iter1270 value 4669375.323671
## iter1280 value 4663171.096827
## iter1290 value 4654551.439058
## iter1300 value 4646095.504591
## iter1310 value 4637712.413318
## iter1320 value 4628697.185983
## iter1330 value 4617862.632218
## iter1340 value 4612940.613412
## iter1350 value 4610200.109483
## iter1360 value 4606639.863775
## iter1370 value 4600348.262175
## iter1380 value 4594638.440412
## iter1390 value 4591150.290138
## iter1400 value 4583073.795355
## iter1410 value 4577131.122402
## iter1420 value 4573495.593067
## iter1430 value 4570381.987990
## iter1440 value 4566368.716089
## iter1450 value 4558438.118651
## iter1460 value 4550286.770402
## iter1470 value 4532598.748749
## iter1480 value 4508059.556597
## iter1490 value 4486434.875776
## iter1500 value 4472675.878678
## iter1510 value 4456478.563183
## iter1520 value 4439355.428547
## iter1530 value 4428077.213840
## iter1540 value 4421709.801099
## iter1550 value 4416947.066238
## iter1560 value 4405416.497475
## iter1570 value 4392062.605773
## iter1580 value 4388203.374474
## iter1590 value 4380868.238976
## iter1600 value 4366799.163906
## iter1610 value 4346919.131242
## iter1620 value 4297195.325302
## iter1630 value 4271397.846995
## iter1640 value 4253210.526550
## iter1650 value 4239065.324885
## iter1660 value 4217305.314791
## iter1670 value 4192273.745466
## iter1680 value 4151889.563624
## iter1690 value 4123157.498042
## iter1700 value 4095028.148080
## iter1710 value 4035238.021968
## iter1720 value 3992999.191109
## iter1730 value 3970510.993959
## iter1740 value 3932532.807414
## iter1750 value 3897845.395836
## iter1760 value 3861586.731904
## iter1770 value 3843931.420332
## iter1780 value 3830398.356272
## iter1790 value 3806038.422886
## iter1800 value 3785337.747438
## iter1810 value 3760265.768530
## iter1820 value 3717314.728178
## iter1830 value 3672922.201778
## iter1840 value 3625946.893495
## iter1850 value 3602968.907786
## iter1860 value 3589536.083877
## iter1870 value 3582753.202218
## iter1880 value 3578915.257406
## iter1890 value 3575515.365280
## iter1900 value 3571912.962148
## iter1910 value 3568007.022570
## iter1920 value 3567364.045516
## iter1930 value 3567161.605271
## iter1940 value 3566934.511826
## iter1950 value 3566738.883362
## iter1960 value 3566252.015464
## iter1970 value 3566017.462739
## iter1980 value 3565852.184021
## iter1990 value 3565374.082521
## iter2000 value 3564388.395040
## iter2010 value 3563653.305807
## iter2020 value 3562883.827107
## iter2030 value 3562043.488463
## iter2040 value 3561185.717619
## iter2050 value 3559970.334918
## iter2060 value 3556859.054456
## iter2070 value 3555588.064839
## iter2080 value 3554372.427717
## iter2090 value 3553306.409237
## iter2100 value 3551837.316540
## iter2110 value 3549907.652729
## iter2120 value 3548371.512788
## iter2130 value 3547743.949143
## iter2140 value 3547722.889984
## iter2150 value 3547669.683719
## iter2160 value 3547587.739835
## iter2170 value 3547131.516797
## iter2180 value 3546796.286713
## iter2190 value 3546552.656148
## iter2200 value 3546430.121555
## iter2210 value 3546260.465512
## iter2220 value 3546133.806593
## iter2230 value 3546020.245206
## iter2240 value 3545909.981841
## iter2250 value 3545634.930367
## iter2260 value 3545124.426822
## iter2270 value 3544594.718777
## iter2280 value 3544113.534928
## iter2290 value 3543673.997748
## iter2300 value 3543453.766373
## iter2310 value 3543034.459118
## iter2320 value 3542848.820398
## iter2330 value 3542591.284904
## iter2340 value 3542285.605819
## iter2350 value 3541812.875221
## iter2360 value 3540755.926437
## iter2370 value 3540718.661016
## iter2380 value 3540701.363798
## iter2390 value 3540669.926611
## iter2400 value 3540640.164547
## iter2410 value 3540582.251667
## iter2420 value 3540489.122246
## iter2430 value 3540413.392597
## iter2440 value 3540347.432918
## iter2450 value 3540231.031308
## iter2460 value 3540131.003184
## iter2470 value 3540029.111724
## iter2480 value 3539616.736825
## iter2490 value 3539354.765518
## iter2500 value 3538313.566925
## iter2510 value 3536749.758398
## iter2520 value 3533999.161912
## iter2530 value 3530331.015172
## iter2540 value 3524939.230450
## iter2550 value 3517953.502637
## iter2560 value 3512357.195638
## iter2570 value 3507447.419731
## iter2580 value 3503399.636003
## iter2590 value 3499402.823315
## iter2600 value 3493905.373491
## iter2610 value 3485257.536225
## iter2620 value 3381573.082673
## iter2630 value 3338729.402271
## iter2640 value 3319278.378477
## iter2650 value 3312783.660533
## iter2660 value 3308449.528535
## iter2670 value 3305974.192476
## iter2680 value 3304211.855306
## iter2690 value 3302443.503378
## iter2700 value 3302222.051852
## iter2710 value 3302196.472379
## iter2720 value 3302174.422090
## iter2730 value 3302151.639144
## iter2740 value 3302126.430492
## iter2750 value 3302090.369510
## iter2760 value 3302038.471514
## iter2770 value 3301663.035924
## iter2780 value 3301558.993849
## iter2790 value 3301294.115936
## iter2800 value 3300997.944026
## iter2810 value 3300870.844426
## iter2820 value 3300613.753409
## iter2830 value 3300246.587786
## iter2840 value 3299467.844868
## iter2850 value 3297886.437587
## iter2860 value 3293623.791957
## iter2870 value 3286205.065787
## iter2880 value 3265637.528298
## iter2890 value 3243992.810821
## iter2900 value 3227501.168198
## iter2910 value 3218393.830802
## iter2920 value 3210864.220402
## iter2930 value 3202569.686503
## iter2940 value 3202420.326311
## iter2950 value 3202302.524755
## iter2960 value 3202115.267855
## iter2970 value 3201843.076007
## iter2980 value 3201169.375103
## iter2990 value 3197228.319865
## iter3000 value 3194999.428792
## iter3010 value 3188204.333313
## iter3020 value 3181011.432762
## iter3030 value 3179025.636007
## iter3040 value 3178019.670779
## iter3050 value 3177583.769319
## iter3060 value 3176822.964896
## iter3070 value 3175558.555499
## iter3080 value 3173631.111907
## iter3090 value 3172059.226297
## iter3100 value 3170421.397750
## iter3110 value 3169240.054163
## iter3120 value 3167059.362467
## iter3130 value 3161008.498174
## iter3140 value 3149841.981309
## iter3150 value 3141301.991144
## iter3160 value 3134245.039375
## iter3170 value 3128256.032757
## iter3180 value 3123445.809331
## iter3190 value 3121048.679062
## iter3200 value 3119278.300741
## iter3210 value 3116220.723241
## iter3220 value 3114293.384877
## iter3230 value 3111355.900700
## iter3240 value 3108690.595627
## iter3250 value 3095166.395606
## iter3260 value 3084393.555359
## iter3270 value 3078146.158986
## iter3280 value 3074410.156545
## iter3290 value 3072852.548926
## iter3300 value 3071884.910004
## iter3310 value 3071055.342018
## iter3320 value 3070749.495930
## iter3330 value 3070553.423179
## iter3340 value 3070245.699938
## iter3350 value 3069960.212081
## iter3360 value 3069866.049142
## iter3370 value 3069813.339748
## iter3380 value 3069713.364211
## iter3390 value 3069673.844520
## iter3400 value 3069651.867665
## iter3410 value 3069643.365503
## iter3420 value 3069642.374175
## iter3420 value 3069642.348247
## iter3420 value 3069642.348061
## final value 3069642.348061
## converged
neural.casual = predict(neural.model.casual, testset, type="raw")
test$neural.casual = neural.casual
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 426.777
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result
test$neural.casual[test$neural.casual < 0] = 0
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 412.3482
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result
Step3: Now combine the predicted registered users and casual users
test$neural.combined = test$neural.casual + test$neural.registered
neural.combined.MSE = sum((test$cnt-test$neural.combined)^2)/nrow(test)
neural.combined.MSE
## [1] 4225.822
Factorize data for the rest of models
# Factorization of training data
train$season <- factor(train$season)
train$yr <- factor(train$yr)
train$mnth <- factor(train$mnth)
train$hr <- factor(train$hr)
train$holiday <- factor(train$holiday)
train$weekday<- factor(train$weekday)
train$workingday <- factor(train$workingday)
train$weathersit <- factor(train$weathersit)
# Factorization of test data
test$season <- factor(test$season)
test$yr <- factor(test$yr)
test$mnth <- factor(test$mnth)
test$hr <- factor(test$hr)
test$holiday <- factor(test$holiday)
test$weekday<- factor(test$weekday)
test$workingday <- factor(test$workingday)
test$weathersit <- factor(test$weathersit)
Step1: Orignal model
# Orignal Model
lm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday +
workingday + weathersit + temp + atemp + hum + windspeed
fit.lm = lm(lm.formula,data=train)
summary(fit.lm)
##
## Call:
## lm(formula = lm.formula, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -357.98 -60.35 -7.81 50.80 436.42
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.367 7.970 -10.962 < 2e-16 ***
## season2 30.622 15.167 2.019 0.043509 *
## season3 3.532 21.332 0.166 0.868512
## season4 37.104 14.999 2.474 0.013383 *
## yr1 87.418 1.859 47.022 < 2e-16 ***
## mnth2 11.189 4.568 2.449 0.014336 *
## mnth3 29.450 4.939 5.963 2.55e-09 ***
## mnth4 24.124 15.883 1.519 0.128824
## mnth5 47.844 15.993 2.991 0.002782 **
## mnth6 35.138 16.281 2.158 0.030926 *
## mnth7 35.805 22.074 1.622 0.104820
## mnth8 49.474 22.012 2.248 0.024623 *
## mnth9 77.941 21.826 3.571 0.000357 ***
## mnth10 60.024 16.067 3.736 0.000188 ***
## mnth11 42.706 15.780 2.706 0.006812 **
## mnth12 38.384 15.071 2.547 0.010883 *
## hr1 -17.267 6.337 -2.725 0.006441 **
## hr2 -27.562 6.357 -4.336 1.47e-05 ***
## hr3 -38.581 6.416 -6.013 1.88e-09 ***
## hr4 -39.780 6.396 -6.219 5.16e-10 ***
## hr5 -24.003 6.364 -3.772 0.000163 ***
## hr6 35.766 6.353 5.630 1.84e-08 ***
## hr7 170.904 6.344 26.938 < 2e-16 ***
## hr8 314.405 6.335 49.630 < 2e-16 ***
## hr9 166.689 6.341 26.287 < 2e-16 ***
## hr10 110.435 6.366 17.348 < 2e-16 ***
## hr11 137.322 6.412 21.417 < 2e-16 ***
## hr12 177.658 6.463 27.490 < 2e-16 ***
## hr13 173.189 6.516 26.580 < 2e-16 ***
## hr14 156.575 6.558 23.876 < 2e-16 ***
## hr15 167.026 6.569 25.426 < 2e-16 ***
## hr16 230.234 6.554 35.127 < 2e-16 ***
## hr17 387.221 6.517 59.413 < 2e-16 ***
## hr18 352.712 6.472 54.495 < 2e-16 ***
## hr19 240.526 6.408 37.538 < 2e-16 ***
## hr20 158.888 6.373 24.931 < 2e-16 ***
## hr21 108.764 6.346 17.138 < 2e-16 ***
## hr22 72.833 6.335 11.497 < 2e-16 ***
## hr23 33.563 6.330 5.302 1.17e-07 ***
## holiday1 -5.153 5.805 -0.888 0.374687
## weekday1 4.933 3.568 1.382 0.166908
## weekday2 10.852 3.447 3.149 0.001644 **
## weekday3 10.816 3.448 3.137 0.001711 **
## weekday4 11.964 3.441 3.477 0.000510 ***
## weekday5 18.493 3.448 5.364 8.30e-08 ***
## weekday6 20.338 3.420 5.946 2.82e-09 ***
## workingday1 NA NA NA NA
## weathersit2 -11.980 2.258 -5.306 1.14e-07 ***
## weathersit3 -65.555 3.806 -17.223 < 2e-16 ***
## weathersit4 -51.508 71.197 -0.723 0.469414
## temp 108.484 33.430 3.245 0.001177 **
## atemp 110.132 34.130 3.227 0.001255 **
## hum -80.389 6.680 -12.035 < 2e-16 ***
## windspeed -33.401 8.380 -3.986 6.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100.4 on 11981 degrees of freedom
## Multiple R-squared: 0.6942, Adjusted R-squared: 0.6929
## F-statistic: 523.1 on 52 and 11981 DF, p-value: < 2.2e-16
lm.cnt=predict(fit.lm, test)
test$lm.cnt = lm.cnt
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 11297.42
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result
test$lm.cnt[test$lm.cnt < 0] = 0
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 10738.16
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Windspeed and atemp are taken out
lm.registered.formula = registered ~ season + yr + mnth + hr + holiday +
workingday + weathersit + temp + atemp + hum
lm.model.registered = lm(lm.registered.formula,data=train)
lm.registered = predict(lm.model.registered,test)
test$lm.registered = lm.registered
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 8157.388
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result
test$lm.registered[test$lm.registered < 0] = 0
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 7763.378
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result
# Model for Casual
# Windspeed is taken out
lm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum
lm.model.casual = lm(lm.casual.formula, data = train)
lm.casual = predict(lm.model.casual,test)
test$lm.casual = lm.casual
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 998.3204
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result
test$lm.casual[test$lm.casual < 0] = 0
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 913.7805
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result
Step3: Now combine the predicted registered users and casual users
test$lm.combined = test$lm.casual + test$lm.registered
lm.combined.MSE = sum((test$cnt-test$lm.combined)^2)/nrow(test)
lm.combined.MSE
## [1] 10684.49
Step1: Orignal model
# Orignal Model
svm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model = svm(svm.formula, data = train)
svm.cnt=predict(svm.model, test)
test$svm.cnt = svm.cnt
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7546.31
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result
test$svm.cnt[test$svm.cnt < 0] = 0
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7511.831
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result
Step2: Sepeate models for registered and casual
# Model for Registered
svm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday + temp + atemp + hum
svm.model.registered = svm(svm.registered.formula, data = train)
svm.registered = predict(svm.model.registered,test)
test$svm.registered = svm.registered
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5870.63
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result
test$svm.registered[test$svm.registered < 0] = 0
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5854.001
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result
# Model for Casual
svm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model.casual = svm(svm.casual.formula, data = train)
svm.casual = predict(svm.model.casual,test)
test$svm.casual = svm.casual
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 650.8312
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result
test$svm.casual[test$svm.casual < 0] = 0
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 647.9083
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result
Step3: Now combine the predicted registered users and casual users
test$svm.combined = test$svm.casual + test$svm.registered
svm.combined.MSE = sum((test$cnt-test$svm.combined)^2)/nrow(test)
svm.combined.MSE
## [1] 7655.972
Step1: Orignal model
# Orignal Model
forest.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model = randomForest(forest.formula, data = train, importance = TRUE, ntree = 200)
forest.model$importance
## %IncMSE IncNodePurity
## season 1871.34242 8635705.9
## yr 6088.63993 29990562.8
## mnth 3312.97212 19240883.7
## hr 37931.20654 207515805.3
## holiday 84.49893 821640.8
## weekday 3881.31674 17465827.4
## workingday 4331.12865 12914924.0
## weathersit 901.14050 6197386.3
## temp 4655.11282 25487693.4
## atemp 4356.66207 29818990.7
## hum 3598.02802 24836117.6
## windspeed 354.96422 6505362.1
forest.cnt = predict(forest.model,test)
test$forest.cnt = forest.cnt
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3880.79
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result
test$forest.cnt[test$forest.cnt < 0] = 0
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3880.79
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result
Step2: Sepeate models for registered and casual
# Model for Registered
forest.registered.formula = registered ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.registered = randomForest(forest.registered.formula, data = train, importance = TRUE, ntree = 200)
forest.model.registered$importance
## %IncMSE IncNodePurity
## season 1478.23494 6764863.7
## yr 4685.71861 22117971.9
## mnth 2069.56038 11215853.4
## hr 29707.85995 155209963.3
## holiday 74.91226 616967.2
## weekday 3304.74832 14854136.8
## workingday 3934.06460 12417113.8
## weathersit 623.19101 4380534.2
## temp 2196.23971 11315541.4
## atemp 2270.60299 13667475.3
## hum 1837.34149 12978243.7
## windspeed 226.53556 4097693.1
forest.registered = predict(forest.model.registered,test)
test$forest.registered = forest.registered
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2621.8
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result
test$forest.registered[test$forest.registered < 0] = 0
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2621.8
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result
# Model for Casual
forest.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.casual = randomForest(forest.casual.formula, data = train, importance = TRUE, ntree = 160)
forest.model.casual$importance
## %IncMSE IncNodePurity
## season 182.16943 623588.3
## yr 236.50493 986376.2
## mnth 320.39413 1617992.7
## hr 1877.97551 9801297.7
## holiday 25.03605 170788.4
## weekday 768.45470 3078717.6
## workingday 813.36772 3066046.2
## weathersit 49.35926 322439.7
## temp 637.40638 3412538.3
## atemp 612.23397 3852357.0
## hum 293.67728 2120674.9
## windspeed 33.82769 514062.8
forest.casual = predict(forest.model.casual,test)
test$forest.casual = forest.casual
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 384.6417
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result
test$forest.casual[test$forest.casual < 0] = 0
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 384.6417
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result
Step3: Now combine the predicted registered users and casual users
test$forest.combined = test$forest.casual + test$forest.registered
forest.combined.MSE = sum((test$cnt-test$forest.combined)^2)/nrow(test)
forest.combined.MSE
## [1] 3525.916
Step1: Orignal model
# Orignal Model
gbm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model = gbm(gbm.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model)
## var rel.inf
## hr hr 60.2734626
## yr yr 9.2049903
## workingday workingday 7.2880830
## temp temp 6.2409444
## mnth mnth 4.9286799
## weekday weekday 3.6216000
## atemp atemp 2.6415323
## season season 1.9275302
## weathersit weathersit 1.8871193
## hum hum 1.5486118
## windspeed windspeed 0.3005041
## holiday holiday 0.1369422
pef.trees = gbm.perf(gbm.model)
## Using OOB method...
gbm.cnt = predict(gbm.model, newdata=test, n.trees=pef.trees)
test$gbm.cnt = gbm.cnt
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3518.01
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result
test$gbm.cnt[test$gbm.cnt < 0] = 0
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3506.468
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result
Step2: Sepeate models for registered and casual
# Model for Registered
# Take off holiday, weedspeed and hum give the best result
gbm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday+ weathersit + temp
gbm.model.registered = gbm(gbm.registered.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.registered)
## var rel.inf
## hr hr 59.236631
## workingday workingday 11.068316
## yr yr 10.799411
## mnth mnth 6.255357
## weekday weekday 4.714831
## temp temp 3.322063
## weathersit weathersit 2.817562
## season season 1.785829
perf.trees.registered = gbm.perf(gbm.model.registered)
## Using OOB method...
gbm.registered = predict(gbm.model.registered,newdata=test,n.trees = perf.trees.registered)
test$gbm.registered = gbm.registered
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2565.314
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result
test$gbm.registered[test$gbm.registered < 0] = 0
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2561.061
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result
# Model for Casual
gbm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model.casual = gbm(gbm.casual.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.casual)
## var rel.inf
## hr hr 34.8206199
## temp temp 14.7735903
## workingday workingday 13.2366269
## weekday weekday 11.1743710
## atemp atemp 8.5829810
## mnth mnth 7.2801053
## yr yr 4.2087383
## hum hum 3.8232396
## windspeed windspeed 0.8347149
## weathersit weathersit 0.7122874
## holiday holiday 0.3983422
## season season 0.1543832
perf.trees.casual = gbm.perf(gbm.model.casual)
## Using OOB method...
gbm.casual = predict(gbm.model.casual,newdata=test,n.trees = perf.trees.casual)
test$gbm.casual = gbm.casual
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 355.4819
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result
test$gbm.casual[test$gbm.casual < 0] = 0
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 354.4988
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result
Step3: Now combine the predicted registered users and casual users
test$gbm.combined = test$gbm.casual + test$gbm.registered
gbm.combined.MSE = sum((test$cnt-test$gbm.combined)^2)/nrow(test)
gbm.combined.MSE
## [1] 3349.674
Step1: Original model
# Orignal Model
formula.cnt <- cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
fit.rpart.cnt <- rpart(formula.cnt, method="anova", data= train)
print(summary(fit.rpart.cnt))
## Call:
## rpart(formula = formula.cnt, data = train, method = "anova")
## n= 12034
##
## CP nsplit rel error xerror xstd
## 1 0.36074746 0 1.0000000 1.0000552 0.016575310
## 2 0.10238796 1 0.6392525 0.6393086 0.012201926
## 3 0.05755831 2 0.5368646 0.5403609 0.009621364
## 4 0.03971267 3 0.4793063 0.4725795 0.008129473
## 5 0.03050477 4 0.4395936 0.4311524 0.007441543
## 6 0.02638756 5 0.4090888 0.4042151 0.006714492
## 7 0.02181266 6 0.3827013 0.3780289 0.006383340
## 8 0.02006172 7 0.3608886 0.3598845 0.006252274
## 9 0.01945226 8 0.3408269 0.3349020 0.005815467
## 10 0.01611249 10 0.3019223 0.3109133 0.005343486
## 11 0.01412510 11 0.2858099 0.2863092 0.005100399
## 12 0.01332590 12 0.2716848 0.2691068 0.005087445
## 13 0.01123726 13 0.2583589 0.2600968 0.004951488
## 14 0.01000000 14 0.2471216 0.2517729 0.004816167
##
## Variable importance
## hr yr mnth temp atemp hum
## 51 9 7 7 7 5
## season workingday weekday windspeed
## 5 4 4 1
##
## Node number 1: 12034 observations, complexity param=0.3607475
## mean=191.5313, MSE=32814.64
## left son=2 (4479 obs) right son=3 (7555 obs)
## Primary splits:
## hr splits as LLLLLLLRRRRRRRRRRRRRRRLL, improve=0.36074750, (0 missing)
## atemp < 0.5985 to the left, improve=0.12935650, (0 missing)
## temp < 0.55 to the left, improve=0.10419890, (0 missing)
## hum < 0.665 to the right, improve=0.07209467, (0 missing)
## yr splits as LR, improve=0.06667235, (0 missing)
## Surrogate splits:
## hum < 0.765 to the right, agree=0.663, adj=0.094, (0 split)
## windspeed < 0.09705 to the left, agree=0.632, adj=0.013, (0 split)
## temp < 0.15 to the left, agree=0.630, adj=0.007, (0 split)
## atemp < 0.08335 to the left, agree=0.629, adj=0.003, (0 split)
##
## Node number 2: 4479 observations, complexity param=0.0141251
## mean=50.22483, MSE=3239.629
## left son=4 (2968 obs) right son=5 (1511 obs)
## Primary splits:
## hr splits as LLLLLLR---------------RR, improve=0.38440840, (0 missing)
## temp < 0.47 to the left, improve=0.06934414, (0 missing)
## atemp < 0.4621 to the left, improve=0.06834863, (0 missing)
## mnth splits as LLLLRRRRRRLL, improve=0.05825211, (0 missing)
## season splits as LRRR, improve=0.05093041, (0 missing)
## Surrogate splits:
## temp < 0.77 to the left, agree=0.666, adj=0.009, (0 split)
## atemp < 0.73485 to the left, agree=0.665, adj=0.008, (0 split)
## hum < 0.11 to the right, agree=0.663, adj=0.001, (0 split)
##
## Node number 3: 7555 observations, complexity param=0.102388
## mean=275.3052, MSE=31492.39
## left son=6 (5036 obs) right son=7 (2519 obs)
## Primary splits:
## hr splits as -------LRLLLLLLLRRRRLL--, improve=0.1699364, (0 missing)
## yr splits as LR, improve=0.1505464, (0 missing)
## temp < 0.49 to the left, improve=0.1391370, (0 missing)
## atemp < 0.47725 to the left, improve=0.1364355, (0 missing)
## season splits as LRRR, improve=0.1233174, (0 missing)
## Surrogate splits:
## windspeed < 0.7985 to the left, agree=0.667, adj=0.001, (0 split)
## weathersit splits as LLLR, agree=0.667, adj=0.000, (0 split)
## atemp < 0.9015 to the left, agree=0.667, adj=0.000, (0 split)
##
## Node number 4: 2968 observations
## mean=25.04549, MSE=933.8123
##
## Node number 5: 1511 observations
## mean=99.68365, MSE=4077.341
##
## Node number 6: 5036 observations, complexity param=0.03971267
## mean=223.5663, MSE=17665.7
## left son=12 (2516 obs) right son=13 (2520 obs)
## Primary splits:
## yr splits as LR, improve=0.1762747, (0 missing)
## temp < 0.45 to the left, improve=0.1504354, (0 missing)
## atemp < 0.44695 to the left, improve=0.1476129, (0 missing)
## season splits as LRRR, improve=0.1474771, (0 missing)
## mnth splits as LLLRRRRRRRRR, improve=0.1461857, (0 missing)
## Surrogate splits:
## hum < 0.505 to the right, agree=0.535, adj=0.070, (0 split)
## temp < 0.55 to the left, agree=0.529, adj=0.057, (0 split)
## atemp < 0.52275 to the left, agree=0.528, adj=0.056, (0 split)
## windspeed < 0.26865 to the right, agree=0.527, adj=0.053, (0 split)
## weathersit splits as RLL-, agree=0.512, adj=0.024, (0 split)
##
## Node number 7: 2519 observations, complexity param=0.05755831
## mean=378.742, MSE=43083.91
## left son=14 (1259 obs) right son=15 (1260 obs)
## Primary splits:
## yr splits as LR, improve=0.2094317, (0 missing)
## temp < 0.49 to the left, improve=0.1993776, (0 missing)
## atemp < 0.47725 to the left, improve=0.1949973, (0 missing)
## season splits as LRRR, improve=0.1673958, (0 missing)
## mnth splits as LLLRRRRRRRRR, improve=0.1626678, (0 missing)
## Surrogate splits:
## hum < 0.445 to the right, agree=0.554, adj=0.107, (0 split)
## atemp < 0.58335 to the left, agree=0.533, adj=0.066, (0 split)
## temp < 0.59 to the left, agree=0.532, adj=0.064, (0 split)
## windspeed < 0.31345 to the right, agree=0.515, adj=0.030, (0 split)
## weathersit splits as RRLR, agree=0.509, adj=0.018, (0 split)
##
## Node number 12: 2516 observations, complexity param=0.01611249
## mean=167.7186, MSE=9589.757
## left son=24 (836 obs) right son=25 (1680 obs)
## Primary splits:
## mnth splits as LLLLRRRRRRRR, improve=0.2637072, (0 missing)
## season splits as LRRR, improve=0.2394119, (0 missing)
## atemp < 0.4621 to the left, improve=0.2192312, (0 missing)
## temp < 0.47 to the left, improve=0.2192312, (0 missing)
## workingday splits as RL, improve=0.0625818, (0 missing)
## Surrogate splits:
## season splits as LRRR, agree=0.909, adj=0.725, (0 split)
## atemp < 0.35605 to the left, agree=0.793, adj=0.378, (0 split)
## temp < 0.39 to the left, agree=0.791, adj=0.372, (0 split)
## hum < 0.405 to the left, agree=0.688, adj=0.060, (0 split)
## windspeed < 0.37315 to the right, agree=0.687, adj=0.057, (0 split)
##
## Node number 13: 2520 observations, complexity param=0.02006172
## mean=279.3254, MSE=19505.74
## left son=26 (721 obs) right son=27 (1799 obs)
## Primary splits:
## temp < 0.39 to the left, improve=0.1611695, (0 missing)
## atemp < 0.4015 to the left, improve=0.1571376, (0 missing)
## mnth splits as LLRRRRRRRRRR, improve=0.1559454, (0 missing)
## season splits as LRRR, improve=0.1512278, (0 missing)
## weekday splits as RLLLLLR, improve=0.0617536, (0 missing)
## Surrogate splits:
## atemp < 0.4015 to the left, agree=0.996, adj=0.986, (0 split)
## mnth splits as LLRRRRRRRRLL, agree=0.873, adj=0.558, (0 split)
## season splits as LRRR, agree=0.803, adj=0.311, (0 split)
## hum < 0.91 to the right, agree=0.715, adj=0.004, (0 split)
##
## Node number 14: 1259 observations, complexity param=0.02181266
## mean=283.7141, MSE=22177.25
## left son=28 (524 obs) right son=29 (735 obs)
## Primary splits:
## mnth splits as LLLLRRRRRRRL, improve=0.3084984, (0 missing)
## season splits as LRRR, improve=0.3003280, (0 missing)
## temp < 0.47 to the left, improve=0.2964665, (0 missing)
## atemp < 0.4621 to the left, improve=0.2964665, (0 missing)
## workingday splits as LR, improve=0.1184052, (0 missing)
## Surrogate splits:
## temp < 0.47 to the left, agree=0.841, adj=0.618, (0 split)
## atemp < 0.4621 to the left, agree=0.841, adj=0.618, (0 split)
## season splits as LRRR, agree=0.833, adj=0.599, (0 split)
## hum < 0.385 to the left, agree=0.626, adj=0.101, (0 split)
## windspeed < 0.31345 to the right, agree=0.622, adj=0.092, (0 split)
##
## Node number 15: 1260 observations, complexity param=0.03050477
## mean=473.6944, MSE=45934.88
## left son=30 (400 obs) right son=31 (860 obs)
## Primary splits:
## workingday splits as LR, improve=0.2081288, (0 missing)
## temp < 0.43 to the left, improve=0.1949022, (0 missing)
## atemp < 0.4318 to the left, improve=0.1887979, (0 missing)
## mnth splits as LLRRRRRRRRRR, improve=0.1842011, (0 missing)
## weekday splits as LRRRRRL, improve=0.1820905, (0 missing)
## Surrogate splits:
## weekday splits as LRRRRRL, agree=0.968, adj=0.900, (0 split)
## holiday splits as RL, agree=0.714, adj=0.100, (0 split)
## atemp < 0.1894 to the left, agree=0.689, adj=0.020, (0 split)
## hum < 0.195 to the left, agree=0.686, adj=0.010, (0 split)
## temp < 0.93 to the right, agree=0.685, adj=0.007, (0 split)
##
## Node number 24: 836 observations
## mean=96.43062, MSE=3492.994
##
## Node number 25: 1680 observations
## mean=203.1929, MSE=8836.312
##
## Node number 26: 721 observations
## mean=190.7587, MSE=9708.069
##
## Node number 27: 1799 observations, complexity param=0.01945226
## mean=314.821, MSE=19028.77
## left son=54 (1240 obs) right son=55 (559 obs)
## Primary splits:
## workingday splits as RL, improve=0.13644570, (0 missing)
## weekday splits as RLLLLLR, improve=0.13176200, (0 missing)
## hr splits as -------R-RLRRRRR----RL--, improve=0.05825584, (0 missing)
## weathersit splits as RRL-, improve=0.05455847, (0 missing)
## atemp < 0.61365 to the left, improve=0.05099518, (0 missing)
## Surrogate splits:
## weekday splits as RLLLLLR, agree=0.974, adj=0.916, (0 split)
## holiday splits as LR, agree=0.715, adj=0.084, (0 split)
## atemp < 0.85605 to the left, agree=0.694, adj=0.016, (0 split)
## temp < 0.95 to the left, agree=0.692, adj=0.009, (0 split)
## hum < 0.205 to the right, agree=0.690, adj=0.004, (0 split)
##
## Node number 28: 524 observations
## mean=185.7519, MSE=11624.71
##
## Node number 29: 735 observations
## mean=353.5537, MSE=17981.19
##
## Node number 30: 400 observations, complexity param=0.01123726
## mean=330.325, MSE=32555.74
## left son=60 (164 obs) right son=61 (236 obs)
## Primary splits:
## temp < 0.49 to the left, improve=0.3407615, (0 missing)
## atemp < 0.47725 to the left, improve=0.3407615, (0 missing)
## hr splits as --------L-------RRRL----, improve=0.2724685, (0 missing)
## mnth splits as LLRRRRRRRRRL, improve=0.2545665, (0 missing)
## season splits as LRRR, improve=0.1791564, (0 missing)
## Surrogate splits:
## atemp < 0.47725 to the left, agree=1.000, adj=1.000, (0 split)
## season splits as LRRL, agree=0.878, adj=0.701, (0 split)
## mnth splits as LLRRRRRRRLLL, agree=0.878, adj=0.701, (0 split)
## hum < 0.745 to the right, agree=0.638, adj=0.116, (0 split)
## weathersit splits as RRL-, agree=0.612, adj=0.055, (0 split)
##
## Node number 31: 860 observations, complexity param=0.02638756
## mean=540.3779, MSE=38150.68
## left son=62 (344 obs) right son=63 (516 obs)
## Primary splits:
## hr splits as --------R-------LRRL----, improve=0.3175968, (0 missing)
## mnth splits as LLLRRRRRRRLL, improve=0.2233810, (0 missing)
## season splits as LRRR, improve=0.2183222, (0 missing)
## temp < 0.49 to the left, improve=0.1973805, (0 missing)
## atemp < 0.47725 to the left, improve=0.1920992, (0 missing)
## Surrogate splits:
## temp < 0.87 to the right, agree=0.602, adj=0.006, (0 split)
## atemp < 0.75 to the right, agree=0.602, adj=0.006, (0 split)
## hum < 0.215 to the left, agree=0.601, adj=0.003, (0 split)
## windspeed < 0.56715 to the right, agree=0.601, adj=0.003, (0 split)
##
## Node number 54: 1240 observations
## mean=280.6089, MSE=8820.401
##
## Node number 55: 559 observations, complexity param=0.01945226
## mean=390.712, MSE=33317.61
## left son=110 (204 obs) right son=111 (355 obs)
## Primary splits:
## hr splits as -------L-LRRRRRR----LL--, improve=0.57408930, (0 missing)
## hum < 0.575 to the right, improve=0.24107720, (0 missing)
## atemp < 0.61365 to the left, improve=0.10204650, (0 missing)
## temp < 0.55 to the left, improve=0.06584068, (0 missing)
## windspeed < 0.31345 to the left, improve=0.04022250, (0 missing)
## Surrogate splits:
## hum < 0.715 to the right, agree=0.696, adj=0.167, (0 split)
##
## Node number 60: 164 observations
## mean=203.9756, MSE=14611.74
##
## Node number 61: 236 observations
## mean=418.1271, MSE=26222.35
##
## Node number 62: 344 observations
## mean=405.564, MSE=17641.27
##
## Node number 63: 516 observations, complexity param=0.0133259
## mean=630.2539, MSE=31629.39
## left son=126 (213 obs) right son=127 (303 obs)
## Primary splits:
## mnth splits as LLLRRRRRRRLL, improve=0.3224286, (0 missing)
## temp < 0.49 to the left, improve=0.3217399, (0 missing)
## season splits as LRRR, improve=0.3196353, (0 missing)
## atemp < 0.47725 to the left, improve=0.3146083, (0 missing)
## weathersit splits as RRLL, improve=0.1475443, (0 missing)
## Surrogate splits:
## season splits as LRRL, agree=0.913, adj=0.789, (0 split)
## temp < 0.49 to the left, agree=0.888, adj=0.728, (0 split)
## atemp < 0.47725 to the left, agree=0.882, adj=0.714, (0 split)
## hum < 0.92 to the right, agree=0.597, adj=0.023, (0 split)
## windspeed < 0.43285 to the right, agree=0.597, adj=0.023, (0 split)
##
## Node number 110: 204 observations
## mean=208.2696, MSE=11069.36
##
## Node number 111: 355 observations
## mean=495.5521, MSE=15983.78
##
## Node number 126: 213 observations
## mean=509.8075, MSE=21483.6
##
## Node number 127: 303 observations
## mean=714.9241, MSE=21394.31
##
## n= 12034
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 12034 394891300 191.53130
## 2) hr=0,1,2,3,4,5,6,22,23 4479 14510300 50.22483
## 4) hr=0,1,2,3,4,5 2968 2771555 25.04549 *
## 5) hr=6,22,23 1511 6160863 99.68365 *
## 3) hr=7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 7555 237925000 275.30520
## 6) hr=7,9,10,11,12,13,14,15,20,21 5036 88964490 223.56630
## 12) yr=0 2516 24127830 167.71860
## 24) mnth=1,2,3,4 836 2920143 96.43062 *
## 25) mnth=5,6,7,8,9,10,11,12 1680 14845000 203.19290 *
## 13) yr=1 2520 49154470 279.32540
## 26) temp< 0.39 721 6999518 190.75870 *
## 27) temp>=0.39 1799 34232750 314.82100
## 54) workingday=1 1240 10937300 280.60890 *
## 55) workingday=0 559 18624540 390.71200
## 110) hr=7,9,20,21 204 2258150 208.26960 *
## 111) hr=10,11,12,13,14,15 355 5674242 495.55210 *
## 7) hr=8,16,17,18,19 2519 108528400 378.74200
## 14) yr=0 1259 27921150 283.71410
## 28) mnth=1,2,3,4,12 524 6091350 185.75190 *
## 29) mnth=5,6,7,8,9,10,11 735 13216170 353.55370 *
## 15) yr=1 1260 57877950 473.69440
## 30) workingday=0 400 13022300 330.32500
## 60) temp< 0.49 164 2396326 203.97560 *
## 61) temp>=0.49 236 6188474 418.12710 *
## 31) workingday=1 860 32809580 540.37790
## 62) hr=16,19 344 6068599 405.56400 *
## 63) hr=8,17,18 516 16320770 630.25390
## 126) mnth=1,2,3,11,12 213 4576007 509.80750 *
## 127) mnth=4,5,6,7,8,9,10 303 6482477 714.92410 *
# Access significant vairables
fit.rpart.cnt$variable.importance
## hr yr mnth temp atemp hum
## 209578416 38411469 27767349 27500369 27021271 20736191
## season workingday weekday windspeed holiday weathersit
## 19497245 16716978 15119647 4642626 1597331 1048780
# Validate the fit.rpart model using testing data
test.cnt <- test[, "cnt"]
test.x <- test[, 3:14]
rpart.cnt <- predict(fit.rpart.cnt, test.x)
test$rpart.cnt <- rpart.cnt
rpart.MSE = mean((rpart.cnt - test.cnt)^2)
rpart.MSE
## [1] 9523.749
rpart.result = ggplot(test,aes(cnt,rpart.cnt))+geom_point()
rpart.result
Step2: Sepeate models for registered and casual
# Model for Registered
formula.registered <- registered ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.registered <- rpart(formula.registered, method="anova", data= train)
test.registered <- test[, "registered"]
rpart.registered <- predict(fit.rpart.registered, test.x)
test$rpart.registered <- rpart.registered
rpart.registered.MSE = mean((rpart.registered - test.registered)^2)
rpart.registered.MSE
## [1] 7096.629
rpart.registered.result = ggplot(test,aes(registered,rpart.registered))+geom_point()
rpart.registered.result
# Model for Casual
formula.casual <- casual ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.casual <- rpart(formula.casual, method="anova", data= train)
test.casual <- test[, "casual"]
rpart.casual <- predict(fit.rpart.casual, test.x)
test$rpart.casual <- rpart.casual
rpart.casual.MSE = mean((rpart.casual - test.casual)^2)
rpart.casual.MSE
## [1] 661.3148
rpart.casual.result = ggplot(test,aes(casual,rpart.casual))+geom_point()
rpart.casual.result
Step3: Now combine the predicted registered users and casual users
rpart.combined = rpart.casual + rpart.registered
test$rpart.combined <- rpart.combined
rpart.combined.MSE = sum((test$cnt-test$rpart.combined)^2)/nrow(test)
rpart.combined.MSE
## [1] 8568.5
Now let’s compare the results of all the models, the comparism table.
rpart.MSEs <- c(rpart.MSE, rpart.combined.MSE, rpart.registered.MSE, rpart.casual.MSE )
rpart.MSEs <- matrix(rpart.MSEs, nrow= 1, ncol=4)
colnames(rpart.MSEs) <-c("cnt.MSE", "combined.MSE", "registered.MSE", "casual.MSE" )
rownames(rpart.MSEs) <- "rpart.MSEs"
lm.MSEs = c(lm.MSE, lm.combined.MSE, lm.registered.MSE, lm.casual.MSE)
forest.MSEs = c(forest.MSE, forest.combined.MSE, forest.registered.MSE, forest.casual.MSE )
svm.MSEs = c(svm.MSE, svm.combined.MSE, svm.registered.MSE, svm.casual.MSE)
neural.MSEs = c(neural.MSE, neural.combined.MSE, neural.registered.MSE, neural.casual.MSE)
gbm.MSEs = c(gbm.MSE, gbm.combined.MSE, gbm.registered.MSE, gbm.casual.MSE)
Summary.MSE=rbind(rpart.MSEs, forest.MSEs, svm.MSEs, neural.MSEs, gbm.MSEs, lm.MSEs)
Summary.MSE
## cnt.MSE combined.MSE registered.MSE casual.MSE
## rpart.MSEs 9523.749 8568.500 7096.629 661.3148
## forest.MSEs 3880.790 3525.916 2621.800 384.6417
## svm.MSEs 7511.831 7655.972 5854.001 647.9083
## neural.MSEs 33104.730 4225.822 3407.284 412.3482
## gbm.MSEs 3506.468 3349.674 2561.061 354.4988
## lm.MSEs 10738.163 10684.486 7763.378 913.7805
As shown above, as of registered users, the random forest provides the best prediction, i.e. lowest MSE, and of casual users, GBM yields the best result. Notice predicted results of random forest and GBM contain unwanted negative values, so we have to manually convert them to 0.
This transformation is crucial because the final output we like to generate is the total number of rentals, which are derived from the predicted registered users (by RandomForest) and predicted casual users (by GBM). Since each output of a model includes negative values, we did transformation before adding those predicted values. Finally, with this additional step, we got the least MSE, which essential would be our ideal model.
After knowing the relative influence (from gbm) of “cnt”, “registered”, “casual”, we attemped to visualize the story behind the scene.
Important Variables (Relative Influence) cnt: hr, yr, workingday, temp, mnth, weekday….. registered: hr, workingday, yr, mnth, weekday….. casual: hr, weekday, temp, workingday…..
Because of the similarities among these factors, we subjectively grouped them into 4 different factor groups and visualized the plottings against “cnt” (Total users), “registered” (Registered users), and “casual” (Casual users).
Factor groups:
(1) hr
(2) yr, mnth
(3) working, weekday, holiday
(4) temp, atemp, humidity
In terms of the number of attributes, we couldn’t do all the plottings between each two. Therefore, to be logical, we showed the most significant relationships between x’s(e.g. hr, yr ..) and y’s (e.g. cnt, registered, casual).
# Ratio of 2 types of users
r.registered <- sum(hour.data$registered) / sum(hour.data$cnt)
r.casual <- (1- r.registered)
print(c("Registered %",r.registered ))
## [1] "Registered %" "0.811698316173547"
print(c("Causal %", r.casual))
## [1] "Causal %" "0.188301683826453"
Registerd users majorly account for the rental usage. Because “hr” is the most significant factor, let’s see the initial plot between “hr” and “cnt”.
plot(x= hour.data$hr, y= hour.data$cnt)
From this plot, we assume there exists an interesting pattern - peak period. As we can see that the cnt stick out during 7-9am and 5-7pm, which coincides with the peak periods when people go to work in the morning and when people get off from work in the afternoon. Let’s see the finer grained plots.
Since hour is the most significant factor for both registered and casual users, let’s see how the cnt, registered and casual fluctuate with the hour.
As the matter of fact that such rush-hours patterns exist, we divided the hour factor into 5 segments to better visualize and understand how rental number changes for both registered users and casual users with the time of a day.
# Create daypart column, default to "Night"
hour.data$daypart <- "Night"
# 0am -7am: "Early morning"
hour.data$daypart[(hour.data$hr >=0 ) & (hour.data$hr <7 )] <- "Early Morning"
# 7am- 9am : "Peak Morning"
hour.data$daypart[(hour.data$hr >=7 ) & (hour.data$hr <9 )] <- "Peak Morning"
# 9am- 5pm : "Day"
hour.data$daypart[(hour.data$hr >=9 ) & (hour.data$hr <17 )] <- "Day"
# 5pm- 7pm : "Peak Evenning"
hour.data$daypart[(hour.data$hr >=17 ) & (hour.data$hr <20 )] <- "Peak Evening"
# Factorization
hour.data$daypart <- factor(hour.data$daypart)
# Count by hour
g.cnt.hr <- ggplot(hour.data, aes(x = hr, y = cnt))
g.cnt.hr + geom_point(aes(color = daypart)) + ggtitle("Total Rental by Hour")
It is shown two peaks during the morning and evening peaking hours from 7am to 9am and from 5pm to 8pm. Let’s break it down into registered and casual users.
# Registered by hour
g.registered.hr <- ggplot(hour.data, aes(x = hr, y = registered))
g.registered.hr + geom_point(aes(color = daypart)) + ggtitle("Registered Rental by Hour")
The peaking hours are even more obvious for registered users. Apprently, many registered users commute to work by rental bikes.
# Casual by hour
g.casual.hr <- ggplot(hour.data, aes(x = hr, y = casual))
g.casual.hr + geom_point(aes(color = daypart)) + ggtitle("Casual Rental by Hour")
There was little impact of the peaking hour on casual users. It implies that people who commute by rental bikes mostly are the registered users. Also, lots of casual users tend to use the service in the afternoon, which may correlate with temperature or other weather factors (because starting from 11am, it gets hotter). We would examine our hypothesis later, the relationship between temperature and the causal users.
# Monthly total rental fluctuation in two years
year <- function(x) {
y =
if (x == 0) 2011
else 2012
return (y)
}
hour.data$year <- factor(sapply(hour.data$yr, year))
g.cnt.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(cnt), colour = as.factor(year)))
g.cnt.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Total Rental Over Two Years")
The ridership increased significantly in 2012. Furthermore, since 81% of the users are registered, we assumed that the ridership of registered users went up drastically. Let’s evaluate our assumption as followed.
# Monthly registered rental fluctuation in two years
g.registered.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(registered), colour = as.factor(year)))
g.registered.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Registered Rental Over Two Years")
Not only did the ridership of registered users increase significantly, there is also an interesting pattern. While the ridership of registered users of the first 7 months increased steadily, it appeared to be a jump from August in 2011. This might result from new policies or other environmental factors.
Notice, usage is generally lower in Winter, which may be related to lower temperature.
# Monthly casual rental fluctuation in two years
g.casual.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(casual), colour = as.factor(year)))
g.casual.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Casual Rental Over Two Years")
There are a lot more casual users in Summer and Fall than Spring and Winter. This fact also explains why casual users are more affected by the environmental settings than registered users.
Recall the previous graph, as compared to casual users, registered users’ usage curve are flatter than causal users’, because registered users who use bikes to commute are using them regularly relatively insensitive to the month.
# Count by hour on workingday
g.cnt.workday <- ggplot(hour.data, aes(x = hr, y = cnt, fill = as.factor(workingday)))
g.cnt.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Total Rental by Workingday")
1 denotes working day, while 0 denotes non-working day. On working days, the peaking hours are very obvious, while on non-working days, many people use the rental bikes in the afternoon. One way to look at this is that maybe on a non-working day, people like to use the service for fun.
# Registered on workingday
g.registered.workday <- ggplot(hour.data, aes(x = hr, y = registered, fill = as.factor(workingday)))
g.registered.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Registered Rental by Workingday")
On working days, most registered users use rental bikes during peak periods, meaning that, again, most registered users are bike commuters. On non-working days, their usage is less fluctuated. Let’s evaluate:
library(data.table)
##
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
##
## year
# Subsetting
sub.hour.data <- hour.data[,c("daypart", "cnt", "registered", "casual")]
# Create data table
dt <-as.data.table(sub.hour.data)
# Extract data where daypart is "Peak Morning" and "Peak Evening "
dt.peak <- dt[daypart %in% c("Peak Morning", "Peak Evening")]
# 42.3% of registered users use bike in peak hours
dt.peak[, sum(registered)] / dt[, sum(registered)]
## [1] 0.4230142
Notice that 81% of the users are registered and 42.3% of whom used bikes in peak hours, indicating the management needs to pay special attention to peak hours bike arrangement.
# Casual on workingday
g.casual.workday <- ggplot(hour.data, aes(x = hr, y = casual, fill = as.factor(workingday)))
g.casual.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Casual Rental by Workingday")
Given that most of the users are registered users (81%), of all 19% users are casual users. Unlike registered users, casual users’ pattern on working days is similar to non-working days. Notably, casual users tend to use the services on non-working days, especially in the day time, when human activities are vivid. Or maybe they couldn’t get enough bikes on working days due to their lower priority.
# Count on weekday
g.cnt.hr.byweekday <- ggplot(hour.data, aes(as.numeric(hr), as.numeric(cnt), colour = as.factor(weekday)))
g.cnt.hr.byweekday + geom_smooth(se = FALSE, method = "auto") + ggtitle("Total Rentals vs Hour (from Sunday to Monday)")
Rentals from Monday to Friday falls into one pattern, while rentals from Saturday and Sunday falls into the other. This pattern matches the result of the previous graph, reflecting that the management has to treat the arrangement of bikes on weekdays and weekends very differently.
Let’s evaluate our assumption:
# Check the propotion of registered & casual users on a working or non-working day
sub.hour.data <- hour.data[,c("workingday", "cnt", "registered", "casual")]
dt <-as.data.table(sub.hour.data)
dt.wd1 <- dt[workingday == 1]
dt.wd0 <- dt[workingday == 0]
# On a working day, 87% of users are registered
dt.wd1[, sum(registered)] /dt.wd1[, sum(cnt)]
## [1] 0.8677004
# While on a non-working day, casual users accounts 32% of total ridership
dt.wd0[, sum(casual)] /dt.wd0[, sum(cnt)]
## [1] 0.3166468
The business insight here is that, when it is a working day, the management should stress on providing the best services for registered users.
Conversely, when it is not, although registered users are still more than casual users, the demand for casual users becomes important, especially from 12- 17pm. On a non-working day, the proportion of causal users increases from 13% on a working day to 32%.
Just a picture of how temperature changes in a one-year time frame.
# mnth vs temp
g.temp.mnth <- ggplot(hour.data, aes(as.numeric(mnth), temp))
g.temp.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Temperature fluctuation in an Year")
As disscused above, we assumed that registered users less affected by environmental settings. Let’s see:
# Registered on temp
g.registered.temp <- ggplot(hour.data, aes(x = temp, y = registered))
g.registered.temp + geom_point()
Without surprise, the number of rental bikes for registered users does not change much according to temperature, which proves our assumption.
# Casual on temp
g.casual.temp <- ggplot(hour.data, aes(x = temp, y = casual))
g.casual.temp + geom_point()
Casual users are more sensitive to temperature than registered users. The usage is much higher between 20 to 30 Celsius degree. Interestingly, casual users find it unbearable when the temperature exceeds 30 degrees, hence, the ridership of casual users dropped greatly.
# Registered/casual on feeled temp
hour.data$raw.atemp <- hour.data$atemp*50
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = registered))
g.temp2 + geom_point()
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = casual))
g.temp2 + geom_point()
The atemp plot is similar to the temp plot.
# Registered/casual on humidity
g.registered.hum <- ggplot(hour.data, aes(x = hum, y = registered))
g.registered.hum + geom_point()
g.casual.hum <- ggplot(hour.data, aes(x = hum, y = casual))
g.casual.hum + geom_point()
Casual users are more sensitive to humidity than registered users, but the casual usage is also kind of smooth except extreme humidity (e.g. heavy rain). It indicates that biking activity is not relatively sensitive to humidity.
The plotting results have confirmed our assumption that hr, mnth, workingday, temp, hum have the major correlation with cnt/registered/casual. We also have the following interesting findings: 1) Most registered users commute to work by rental bike, while casual users do not. 2) 2012 showed an increase in users from 2011, contributed majorly by registered users. 3) On working days and non-working days, the usage pattern by hour differs a lot. 4) Casual users are more sensitive to weather condition than registered users.
The biking sharing system should allocate bikes considering these facts.
[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, }
[2] https://rpubs.com/saitej09/bikesharing